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Abstract. Compressive sensing (CS) is a new approach for the acquisition and recovery of 
sparse signals and images that enables sampling rates significantly below the classical Nyquist rate. 
Despite significant progress in the theory and methods of CS, little headway has been made in 
compressive video acquisition and recovery. Video CS is complicated by the ephemeral nature of 
dynamic events, which makes direct extensions of standard CS imaging architectures and signal 
models difficult. In this paper, we develop a new framework for video CS for dynamic textured 
scenes that models the evolution of the scene as a linear dynamical system (LDS). This reduces 
the video recovery problem to first estimating the model parameters of the LDS from compressive 
measurements, and then reconstructing the image frames. We exploit the low-dimensional dynamic 
parameters (the state sequence) and high-dimensional static parameters (the observation matrix) 
of the LDS to devise a novel compressive measurement strategy that measures only the dynamic 
part of the scene at each instant and accumulates measurements over time to estimate the static 
parameters. This enables us to lower the compressive measurement rate considerably. We validate 
our approach with a range of experiments involving both video recovery, sensing hyper-spectral data, 
and classification of dynamic scenes from compressive data. Together, these applications demonstrate 
the effectiveness of the approach. 

Key words. Compressive sensing, Linear dynamical system, Video compressive sensing, Hyper- 
spectral imaging 

1. Introduction. The Shannon-Nyquist theorem dictates that to sense features 
at a particular frequency, we need to sample uniformly at twice that rate. For some 
applications, this sampling rate might be too high and/or redundant; in modern digital 
cameras, invariably, the sensed imaged is compressed immediately without much loss 
in quality. For some applications, such as high speed imaging and sensing in the non- 
visual spectrum, camera/sensor designs based on the Shannon-Nyquist theorem lead 
to impractical and costly designs. Part of the reason for this is that the Shannon- 
Nyquist sampling theory does not exploit any structure in the sensed signal beyond 
that of band-limitedness. Signals with redundant structures can potentially be sensed 
more parsimoniously. This is the key idea underlying a new field called compressive 
sensing (CS) [7]. When the signal of interest exhibits a sparse representation, CS 
enables sensing at measurement rates below the Nyquist rate. Indeed, signal recovery 
is possible from a number of measurements that is proportional to the sparsity level 
of the signal, as opposed to its bandwidth. 

In this paper, we consider the problem of sensing videos compressively. We are 
interested in this problem motivated by the success of video compression algorithms, 
which indicates that videos are highly redundant. Bridging the gap between compres- 
sion and sensing can lead to compelling camera designs that significantly reduce the 
amount of data sensed and enable designs for application domains where sensing is 
inherently costly. 

Video CS is challenging for two main reasons: 
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• Ephemeral nature of videos: The scene changes during the measurement 
process; moreover, we cannot obtain additional measurements of an event 
after it has occurred. 

• High-dimensional signals: Videos are significantly higher-dimensional 
than images. This makes the recovery process computationally intensive. 

One way to address these challenges is to restrict the scope to estimation of parametric 
models that are suitable for a broad class of videos. 

In this paper, we develop a CS framework for videos modeled as linear dynamical 
systems (LDSs) — motivated, in part, by the extensive use of such models in char- 
acterizing dynamic textures fTTJ[l5l|32] , activity modeling, and video clustering [35] . 
Parameteric models, like LDSs, offer lower dimensional representations for otherwise 
high-dimensional videos. This restricts the number of free parameters that need to be 
estimated and, as a consequence, reduces the amount of data that needs to be sensed. 
In the context of video sensing, LDSs offer interesting tradeoffs by characterizing 
the video signal using a mix of dynamic/time- varying parameters and static/time- 
invariant parameters. Further, the generative nature of LDSs provides a prior for 
the evolution of the video in both forward and reverse time. To a large extent, this 
property helps us circumvent the challenges presented by the ephemeral nature of 
videos. 

The paper makes the following contributions. We propose a framework called 
CS-LDS for video acquisition using an LDS model coupled with sparse priors for the 
parameters of the LDS model. The core of the framework is a two-step measurement 
strategy that enables the recovery of the LDS parameters from compressive measure- 
ments by solving a sequence of linear and convex problems. We demonstrate that 
CS-LDS is capable of sensing videos and hyper- spectral data with far fewer measure- 
ments than the Nyquist rate. Finally, the LDS parameters form an important class 
of features for activity recognition and scene analysis, thereby making our camera 
designs purposive 
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as well. 
2. Background. 

2.1. Compressive sensing. CS deals with the recovery of a signal y e W N from 
undersampled linear measurements of the form z = <l>y + e, where 3> e R MxAr is the 
measurement matrix, M < TV, and e is the measurement noise [7|[l4]- Estimating y 
from the measurements z is ill-conditioned, since the linear system formed by z = <l>y 
is under-determined. CS works under the assumption that the signal y is sparse 
in a basis \£; that is, the signal s, defined as y = \I/s, has at most K non-zero 
components. Exploiting the sparsity of s, the signal y can be recovered exactly from 
M = 0(Klog(N/K)) measurements provided the matrix <1>\I/ satisfies the so-called 
restricted isometry property (RIP) |4 . In particular, when \£ is an orthonormal basis 
and the entries of the matrix are i.i.d. samples from a sub-Gaussian distribution, 
the product <£\I/ satisfies the RIP. Further, the signal y can be recovered from z by 
solving a convex problem of the form 

min ||s||i subject to ||z - $^s|| 2 ^ e (1) 

where e is an upper bound on the measurement noise e. It can be shown that the 
solution to ([!]) is with high probability the if-sparse solution that we seek. The 
theoretical guarantees of CS have been extended to compressible signals, where the 
sorted coefficients of s decay rapidly according to a power-law [21] . 

There exist a wide range of algorithms to solve under various approximations 
or reformulations [7p6] . Greedy techniques such as Orthogonal Matching Pursuit [27] 
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and CoSAMP 25 solve the sparse approximation problem efficiently with strong 
convergence properties and low computational complexity. It is also simple to impose 
structural constraints such as block sparsity into CoSAMP giving variants such as 
model-based CoSAMP (3] . 

2.2. Video compressive sensing. In video CS, the goal is to sense a time- 
varying scene using compressive measurements of the form z t = ^tYt-, where z t ,<f* t 
and yt are the compressive measurements, the measurement matrix and the video 
frame at time t, respectively. Given the sequence of compressive measurements zi-t = 
{zi, Z2, . . . , Zt}, our goal is to recover the video frames yi-.T = {yi,y2> ■ ■ ■ jYt}- There 
are currently two fundamentally different imaging architectures for video CS: the 
single pixel camera (SPC) and the programmable pixel camera. The SPC [l6] uses a 
single or a small number of sensing elements. Typically, a photo-detector is used to 
obtain a single measurement at each time instant of the form z t = 4>Jyt, where <p t 
is a pseudo-random vector of Os and Is. Typically, under an assumption of a slowly 
varying scene, consecutive measurements from the SPC are grouped as measurements 
of the same video frame. This assumption works only when the scene motion is small 
or when the number of measurements associated with a frame is small. The SPC 
provides complete freedom in the spatial multiplexing of pixels; however, there is no 
temporal multiplexing. In contrast, programmable pixel cameras 22,30,41 use a full 
frame sensor array; during each exposure of the sensor array, the shutter at each pixel 
is temporally modulated. This enables extensive temporal multiplexing but a limited 
amount of spatial multiplexing. A key advantage of SPC-based designs is that they 
can operate efficiently at wavelengths (such as the far infrared) that require exotic 
detectors; in such cases, building a full frame sensor can be prohibitively expensive. 

To date, recovery algorithms for the SPC have used various signal models to 



reconstruct the sensed scene. Wakin et al. 42 use 3D wavelets as the sparsifying 
basis for recovering videos from compressive measurements. Park and Wakin 26 use a 
coarse-to-fine estimation framework wherein the video, reconstructed at a coarse scale, 
is used to estimate motion vectors that are subsequently used to design dictionaries for 



reconstruction at a finer scale. Vaswani 38 and Vaswani and Lu [39] use a sequential 



framework that exploits the similarity of support of the signal between adjacent frames 
of a video. Under this model, a frame of video is reconstructed using a linear inversion 
over the support at the previous time instant and a small-scale CS recovery over the 
residue to detect components beyond the known support. Cevher et al. provide 
a CS framework for directly sensing innovations over a static scene thereby enabling 
background subtraction from compressive measurements. 

2.3. Linear dynamical system model for video sequences. Linear dynam- 
ical systems (LDSs) represent an important class of parametric models for time-series 
data. A wide variety of spatio-temporal signals have often been modeled as realiza- 
tions of LDSs. These include dynamic textures [15], traffic scenes [IT], and human 
activities [35] . 

Under an LDS model, the evolution of the video frames y t is described in terms 
of a hidden state space 

y t = Cx £ + w t , 
x £+ i = Ax t + v t 

where x t £ M d is the state vector at time t, d is the dimension of the state space, 
A e ]R dxd is the state transition matrix, C £ R 7Vxd is the observation matrix, Q and 
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R are the process and observation noise covariance matrices, respectively. For the 
videos of interest in this paper, d « N. 

LDSs are parameterized by the matrix pair (C,A). Note that the choice of C 
and the state sequence x i:T is unique only up to a d x d linear transformation given 
the inherent ambiguities in the notion of a state space. In particular, given any 
invertible d x d matrix L, the LDS defined by (C, A) with the state sequence xi : t is 
equivalent to the LDS defined by (CL, L _1 AL) with the state sequence L _1 Xi : t = 
{L- 1 xi,L- 1 x 2 ,...,X- 1 x T }. This lack of uniqueness has implications that we will 
touch upon later in Section [5j 

Given a video sequence, the most common approach to fitting an LDS model 
is to first estimate a lower-dimensional embedding of the observations via principal 
component analysis (PC A) and then learn the temporal dynamics captured in x t , 
and equivalently A. The most popular model estimation algorithms are N4SID [37] , 
PCA-ID [34] , and expectation-maximization (EM) [II]. N4SID is a subspace iden- 
tification algorithm that provides an asymptotically optimal solution for the model 
parameters. However, for large problems the computational requirements make this 
method prohibitive. PCA-ID 34 is a sub-optimal solution to the learning problem. 
It makes the assumption that filtering in space and time are separable, which makes 
it possible to estimate the parameters of the model very efficiently via PCA. The 
learning problem can also be posed as a maximum likelihood estimation of the model 
parameters that maximize the likelihood of the observations, which can be solved by 
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3. CS-LDS Architecture. We provide a high level overview of our proposed 
framework for video CS; the goal here is to build a CS framework, implement able on 
the SPC, for videos that are modeled as LDSs. We flesh out the details in Sections [4] 
and [5] This amounts to estimating the LDS parameters from compressive measure- 
ments, i.e, we seek to recover the model parameters C and x i:T given compressive 
measurements of the form z t = &tYt = ^tCx^. We recall that C is the time-invariant 
observation matrix of the LDS, and y t and x t are the video frame and the state 
at time t, respectively. The compressive measurements zi-t are hence expressed as 
bilinear terms in the unknown parameters C and Xi : £. Handling bilinear unknowns 
typically requires non-convex optimization techniques thereby invalidating conven- 
tional CS recovery algorithms. To avoid this, we propose a two-step sensing method 
that is specifically designed to address the bilinearity; we refer to this sensing method 



and its associated recovery algorithm as the CS-LDS framework 33 . 
Measurement model: We summarize the CS-LDS measurement model as follows. 
At time t, we take two sets of measurements: 



z t = ( ~ 



yt = $tyt, (2) 



where z t e M^^and z+_ e M M such that the total number of measurements at each 
frame is M = M + m{^| The measurement matrix in ^ is composed of two distinct 
components: a time-invariant part 4> and a time-varying part We denote by z t 
the common measurements and by z t the innovation measurements. 



1 The SPC obtains only one measurement at each time instant. Multiple measurements for a 
video frame are obtained by grouping consecutive measurements from the SPC. When M is small, 
compared to the sampling rate of the SPC, this is an acceptable approximation especially for slowly 
varying scenes. 
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We solve for the LDS parameters in two steps. First, we obtain an estimate of 
the state sequence using only the common measurements Zi : t- Second, we use this 
state sequence estimate to recover the observation matrix C using the innovation 
measurements. 

State sequence estimation: We recover the state sequence xi : y using only the 
common measurements zi : t- The key idea is that when yi-.T form the observations 
of an LDS with system matrices (C, A), the measurements zi : t form the observations 
of an LDS with system matrices (3>C, A). Estimation of the state sequence now can 
be mapped to a simple exercise in system identification. In particular, an estimate of 
the state sequence can be obtained by the singular value decomposition (SVD) of the 
block-Hankel matrix 



Hank(zi :T , d) 



Zi z 2 
z 2 ••' 



Z d 



Z T _i 



z T-d+l 
z T-d+2 



(3) 



Given the SVD (Hank (zi : t, d)) = UhShVjj, the state sequence estimate is given by 

[xi :T ] = ShV£. 

In Section [4j we leverage results from system identification to analyze the properties 
of this particular estimate as well as characterize the number of measurements M 
required. 

Observation matrix estimation: Given knowledge of the state sequence, the rela- 
tionship between the observation matrix C and the innovation measurements is linear, 
i.e., z t = <& t Cx t . In addition, C is time-invariant. Hence, we can accumulate innova- 
tion measurements over a duration of time to stably^reconstruct C. This significantly 
reduces the number of innovation measurements M required at each frame. This is 
especially important in the context of sensing videos, since the scene changes as we 
acquire measurements. Hence, requiring fewer measurements for each reconstructed 
frame of the video implies less error due to motion blur. 

Using the estimates of the state sequence xi : t, we can recover C by solving the 
following convex problem: 



mm 



En* 3 



s.t. Vt, ||z t -* t Cxt||2*$e, 



(4) 



where denotes the i-th column of C and ^ is a sparsifying basis for the columns 



of C. However, as we show later in Section 5.2, ambiguities in the estimation of the 
state sequence induce a structured sparsity pattern in the support of C. The convex 
program Q can be modified to incorporate such constraints. In addition to this, in 
Section [5j we also propose a greedy alternative for solving a variant of the convex 
program. 

To summarize (see Figure [T]), the two-step measurement process described in Q 
enables a two-step recovery. First, we obtain an estimate of the state sequence using 
SVD on just the common measurements. Second, we use the state sequence estimate 
for recovering the observation matrix using a convex program. The details of these 
two steps are discussed in the next two sections. 
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Fig. 1. Block diagram of the CS-LDS framework. 



4. Estimating the state sequence. In this section, we discuss methods to 
estimate the state sequence x i:T from the compressive measurements z i:T . In partic- 
ular, we seek to establish sufficient conditions under which the state sequence can be 
estimated reliably. 

4.1. Observability of the state sequence. Consider the compressive mea- 
surements given by 

z t = $y t +u u (5) 

where z t £ M M are the compressive measurements at time t, 4> e R MxAr is the 
corresponding measurement matrix, and uo t £ M M is the measurement noise. Note 
that $ is time-invariant; hence, ^ is a part of the measurement model described in 
([2| relating to the common measurements. A key observation is that, when yi : T form 
the observations of an LDS defined by (C, A), the compressive measurement sequence 
zi : t forms an LDS as well; that is, 

zt = $y t +oj = $Cx^ +c^, 
x t = Ax^_i + w t . 

The LDS associated with zi-t is parameterized by the system matrices ($C, A). Esti- 
mating the state sequence from the observations of an LDS is possible only when the 
LDS is observable 5 . Thus, it is important to consider the question of observability 
of the LDS parameterized by ($C, A)^\ 

Definition 4.1 (Observability of an LDS 5 ). An LDS is observable if, for any 
possible state sequence, the current state can be estimated from a finite number of 
observations. 

Lemma 4.2 (Test for observability of an LDS 5 ). An LDS defined by the sys- 
tem matrices (C, A) and of state space dimension d is observable if and only if the 
observability matrix 

(0(C, Ajf = [C T (CAf (CA 2 f ■ ■ ■ {CA d - l ) T } T (6) 

is full rank. 

A necessary condition for the observability of the LDS defined by ($C, A) is that 
the LDS defined by (C, A) is observable. However, for the LDSs we consider in this 



2 Observability of LDSs in the context of CS has been studied earlier by Wakin et al. ^3], who 
consider the scenario when the observation matrix C is randomly generated and the state vector xo 
at t — is sparse. In contrast, the analysis we present is for a non-sparse state vector. 
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paper, N » d; for such systems, the LDS defined by (C, A) is observable. Given this 
assumption, we consider the observability of the LDS parameterized by (<£C, A) next. 

Lemma 4.3. For N > d, the LDS defined by (^C, A) is observable, with high 
probability, if M ^ d and the entries of the matrix 4> are sampled i.i.d. from a 
sub- Gaussian distribution. 

Proof. This is established by proving that rank(i>C) = d when M ^ d. Assume 
that rank($C) < d, i.e., 3 a e R d such that $Ca = 0, a ^ 0. Let (j) T be a row of 
4>. The event that <j) T Ca = is one of negligible probability when the elements of <\> 
are assumed to be i.i.d. according to a sub-Gaussian distribution such as Gaussian or 
Bernoulli. Hence, with high probability rank(i>C) = d when M ^ d. □ 

Observability is the key criterion for recovering the state sequence from the com- 
mon measurements. When the LDS associated with the common measurements is 
observable, we can estimate the state sequence — up to a linear transformation — 
by factorizing the block Hankel matrix Hank(zi : T, d) in Hank(zi : x,d) can be 
written as 

Hank(zi :T ,d) = 0($C,A)[xi x 2 • • • x T _ d+1 ]. 

Hence, when the observability matrix 0(i>C, A) is full rank, we can recover the state 
sequence by factoring the Hankel matrix using the SVD. Suppose the SVD of the 
Hankel matrix is Hank(zi : T,d) = USV T . Then, the estimate of the state sequence is 
obtained by 

[x 1:T _ d+1 ] = S d Vj, (7) 

where Sa is the diagonal matrix containing the d-largest singular values in 5, and Va 
is the matrix composed of the right singular vectors corresponding to these singular 
values. The estimate of the state sequence obtained from SVD differs from its true 
value by a linear transformation. This is a fundamental ambiguity that stems from 



the lack of uniqueness in the definition of the state space (see Section 2.3). The state 
sequence estimate in ([7| can be improved, especially for high levels of measurement 
noise, by using system identification techniques mentioned in Section [23) However, 
the simplicity of this estimate makes it amenable for further analysis. 

When M > d, we can choose to factorize a smaller-sized Hankel matrix 
Hank(zi-T, q) provided q > d/M. Note that when q = 1, we do not enforce the 
constraints provided by the state transition model, thereby simply reducing the LDS 
to a linear system. For q > 1, we enforce the state transition model over q successive 
time instants; i.e., we enforce 

x, = M-i = A 2 x t _2 = ■■■ = A^xt-q+i, q^t^T. 

Larger values of q lead to smoother state sequences, since the estimates conform to 
the state transition model for longer durations. 

We next study the observability properties of specific classes of interesting LDSs 
and the conditions on 4> under which the observability of (3>C, A) holds. 

4.2. Case: M = 1. A particularly interesting scenario is when we obtain exactly 
one common measurement for each video frame. For such a scenario, M = 1 and, 
hence, the measurement matrix can be written as a row- vector: 4> = cj) T eR lxJv . We 
now establish conditions when the observability matrix 0{(j) T C 1 A) is full rank for this 
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particular scenario. Let c = {(j) T C) T = C T (j) and B = A T . We seek a condition when 
the observability matrix or equivalently its transpose 



(0(c T ,B T )Y = [c Be B 2 c ••• B d ~ x c] 



(8) 



is full rankj^] We concentrate on the specific scenario where the matrix B (and hence, 
A) is diagonalizable, i.e., B = QAQ -1 , where Q e W ixd is an invertible matrix (hence, 
full rank) and A is a diagonal matrix with diagonal elements {A^, 1 ^ i < d}. For 
such matrices, the transpose of the observability matrix can be written as 

j 



(0(S r ,B r )) J 



c Be B 2 c 
QQ- 
Q[e 



B 



d-l-. 







L c QKQ~ x c ••• QA^Q^c] 



Ae A 2 e 



where e = Q 1 c. This can be expanded as 

Aiei Afei 



e 2 
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A 2 e 2 



A^e 2 



Af 



and further into 



Q 



ei 







e2 



Ai 
A 2 



1 A d 



A? 
Ai 



\ 2 



vd-1 



We can establish a sufficient condition for when the observability matrix is full 
rank. 

Theorem 4.4. Let M = 1 and let the elements of & = (j) T be i.i.d. from a 
sub-Gaussian distribution. Then, with high probability, the observability matrix is 
full rank when the state transition matrix is diagonalizable and its eigenvectors and 
eigenvalues are unique. 

Proof. From the discussion above, the observability matrix can be written as a 
product of three square matrices: Q, the matrix of eigenvectors of A T ; a diagonal 
matrix with entries defined by the vector e = Q _1 C T 0; and a Vandermonde matrix 
defined by the vector of eigenvalues of A {A^, 1 < i < d}. When the eigenvectors 
and eigenvalues are distinct, the first and last matrices are full rank. Given that 
the elements of <ft are i.i.d., the probability that = is negligible and, hence, the 
diagonal matrix is full rank with high probability. Since the product of full rank 
square matrices is full rank as well, this implies that the observability matrix is full 
rank with high probability. □ 

Theorem |4.4| is intriguing, since it guarantees recovery of the state sequence even 
when we obtain only one common measurement per time instant. This is immensely 
useful in reducing the number of measurements required to sense a video sequence. 

Interestingly, we can reduce M even further. This is achieved by not obtaining 
common measurements at some time instants. 



3 There is an interesting connection to Krylov-subspace methods here. In Krylov-subspace meth- 
ods, a low-rank approximation to a matrix K is obtained by forming the matrix [c Kc K 2 c • • •] 
with c randomly chosen. Convergence proofs for this method are closely related to Theorem |4.4| To 
the best of our knowledge, diagonalizability of K plays an important role in most of these proofs. 
The interested reader is referred to [311 for more details. 
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4.3. Missing measurements: Case M < 1. If we do not obtain common 
measurements at some time instants, then is it still possible to obtain an estimate 
of the state sequence? One way to view this problem is that we have incomplete 
knowledge of the Hankel matrix defined in Q and we seek to complete this matrix. 
Matrix completion, especially for low rank matrices, has received significant attention 



recently [6,8 29 . 

Given that the Hankel matrix Hank(zi : x, q) in (J3]) is low rank for videos modeled 
as LDSs, we formulate the missing measurement recovery problem as one of matrix 
completion. Suppose that we have the common measurements only at time instants 
given by the index set c {1,...,T}, i.e., we have knowledge of {z^, i e J^}. 
We can recover the missing measurements by exploiting the low-rank property of 
Hank(zi : T, q)- Specifically, we solve the following problem to obtain the missing 
measurements: 

minrank(Hank(gi : T, q)) s.t. = z^,i e J? . 

However, rank(-) is a non-convex function which renders the above problem NP- 
complete. In practice, we can solve a convex relaxation of this problem^] 

min||Hank(gi :T ,g)|| # s.t. g, = z t ,ie/, (9) 

where \\H\\* is the nuclear norm of the matrix which equals the sum of its singular 
values. Once we fill in the missing measurements, we use ^ to recover an estimate 
of the state sequence. 

An important quantity to characterize is the proportion of time instants where 
we can choose to not obtain common measurements. This amounts to developing a 
sampling theorem for the completion of low-rank Hankel matrices; to the best of our 
knowledge, there has been little theoretical work on this problem. Instead, we address 
it empirically in Section [6j 

5. Estimating the observation matrix. In this section, we discuss estimation 
of the observation matrix C given the estimates of the state space sequence xi : t- 

5.1. Need for innovation measurements. Given estimates of the state se- 
quence xi;T, the matrix C is linear in the compressive measurements which enables 
a host of conventional ^-based methods as well as ^i-based recovery algorithms to 
estimate C. However, recall that the C is a N x d matrix and, hence, the common 
measurements by themselves are not enough to recover C, unless M is large. 

The common measurements t,\-t used in the estimation of the state sequence are 
measured using a time-invariant measurement matrix 4>. A time- invariant measure- 
ment matrix, by itself, is not sufficient for estimating C unless M is very large. To 
alleviate this problem, we take additional compressive measurements of each frame 
using a time- varying measurement matrix. Let z t = &tyt + = &tCx t + (j tj where 
z t e R M and <t t G ]R MxAr are the compressive measurements and the corresponding 
measurement matrix at time t. As mentioned earlier in Section [3j we refer to these 
as innovation measurements. Noting that C is a time- invariant parameter, we can 
collect innovation measurements over a period of time before reconstructing C. This 
enables a significant reduction in the number of measurements taken at each time 
instant. 



4 Historically, the use of nuclear norm-based optimization for system identification goes back to 
Fazel et al. |18||19] . Since then, there has been much work towards establishing the equivalence of 
these two problems [6||29l. 
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5.2. Structured sparsity for C. We employ sparse priors in the recovery of 
the observation matrix C. For a large class of videos, the columns of C represent 
the dominant motion in the scene; when motion in the scene is spatially correlated, 
the columns of C are compressible in wavelet/DCT basis. Hence, we can potentially 
obtain an estimate of C by solving the following convex program: 

d 

(P £l ) miii2ll* T Ci||i s.t W,||y t -$ t Cx t || 2 <e. (10) 

2=1 

Here, we denote the columns of the matrix C as c^, i = 1, . . . , d. ^ is a sparsifying basis 
for the columns of C; we have the freedom to choose different sparsifying bases for 
different columns of C. In addition to this, we can use dictionary learning algorithms 



23 to learn an appropriate basis where in the columns of C are sparse/compressible. 
When the columns of C are not sparse, the t\ -prior fails. For such systems, we can 
revert to ^2-based methods to recover C; in such cases, we would typically need more 
measurements to recover C. 

However, the convex program (P^ ) is not sufficient as-is to recover C. The reason 



for this stems from ambiguities in the definition of the LDS (see Section 2.3). The 
use of SVD for recovering the state sequence introduces an ambiguity in the estimates 
of the state sequence in the form of [xi : t] % -^ _1 [xi : t]? where L is an invertible 
d x d matrix. As a consequence, this will lead to an estimate C = CL satisfying 
z = <£Cx£ = $(CL)(L _1 Xt) = ^Cx^. Suppose the columns of C are if-sparse 
(equivalently, compressible for a certain value of K) each in ^ with support Sk for 
the k-th column. Then, the columns of CL are potentially dif -sparse with identical 
supports S = U fc Sk- The support is exactly (iif-sparse when the Sk are disjoint and 
L is dense. At first glance, this seems to be a significant drawback, since the overall 
sparsity of C has increased to d 2 K (the sparsity of C is dK). However, this apparent 
increase in sparsity is alleviated by the columns having identical supports, which can 
be exploited in the recovery process. 

Given the estimates xi : t, we estimate the matrix C by solving the following 
convex program: 

N 

(P M ) min^ || Si || 2 s.t C = 9S, Vt, \\z t - * t Cx t || 2 ^ e, (11) 

i=l 

where is the i-th row of the matrix S = ^ T C and ^ is a sparsifying basis for the 
columns of C. The above problem is an instance of an £2 mixed- norm optimization 
that promotes group sparsity; in this instance, we use it to promote group column 
sparsity in the matrix S, i.e., all columns have the same sparsity pattern. 

There are multiple efficient ways to solve (P^-^), including solvers such as 
SPG-L1 (36] and model-based CoSAMP 3 . Algorithm [j] summarizes a model-based 
CoSAMP algorithm used for recovering the observation matrix C. The specific model 
used here is a union-of-subspaces model that groups each row of S = ^ T C into a single 
subspace/model. This greedy solution offers a computationally efficient alternative to 
the convex program (P^-^) at a small price in the accuracy of the result. In addition 
to this, in many applications, the parameters associated with the CoSAMP algorithm 
are far more intuitive. Specifically, the only parameter required in Algorithm [l] is the 
sparsity K or the expected number of non-zeros in each column of S = ^ T C. 

5.3. Value of M. For stable recovery of the observation matrix C, we need 
in total 0(dK \og(N / K)) measurements; for a large class of practical solvers, a rule 
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Algorithm 1: C = Model-based CoSAMP (tf , if, z t , x t , $t,£ = 1, . . . ,T) 
Notation: 

supp(vec; if) returns the support of if largest elements of vec 
A | ft ? . represents the submatrix of A with rows indexed by Q and all 
columns. 

A|. q represents the submatrix of A with columns indexed by Q and all 
rows. 

Initialization 
W, 9/ <- 
W,v,^OeR m 

while (stopping conditions are not met) do 
Compute signal proxy: 

Compute energy in each row: 

ke[l,...,N],v(k)=Y,LiR 2 (k,i) 
Support identification and merger: 

Q <- ^oid U supp(r; 2if ) 
Least squares estimation: 

Find A e R\ n \ xd that minimizes 2 £ K - (6t)|.,nAx t || 2 

Pruning support: 

fc6[l,...,iV],b(fc)=E t tiS 2 (fc,i) 

ft <- supp(b;if), 5|q 5 . <- 5| ,., 5|^v <- 
Form new estimate of C: 

C ^ 
Update residue: 

V*,v t <-z t -e t Sx t 

end 



of thumb is 4<iif log(iV/if ). Given that we measure M time- varying compressive 
measurements at each time instant, over a period of T time instants, we have MT 
compressive measurements for estimating C. Hence, for stable recovery of C, we need 
approximately 

~ — dK 

MT = AdK \og{N/K) => M = 4— log(iV/if). (12) 

This indicates extremely favorable operating scenarios for the CS-LDS framework, 
especially when T is large (as in high frame rate capture). Let T = rf 3 , where r is the 
time duration of the video in seconds and f s is the sampling rate of the measurement 
device. The number of compressive measurements required in this case is M = 4^. 
Given that the complexity of the LDS typically (however, not always) depends on r, 
for a fixed r the number of measurements required to estimate C decreases as l/f s 
as the sampling rate f s is increased. Indeed, as the sampling rate f s increases, M 



can be decreased while keeping Mf s constant. This will ensure that (12) is satisfied, 
enabling stable recovery of C. 
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5.4. Mean + LDS. In many instances, a dynamical scene is modeled better as 
an LDS over a static background, that is, y t = Cx t + \i. This can be handled with 
two small modifications to the algorithm described in Section |5.2| First, the state 
sequence [xi :T ] is obtained by performing an SVD on the matrix Hank(zi :T , <i gue ss) 
modified such that each row sums to zero. This works under the assumption that the 
sample mean of z\-t is equal to <l>/i, the compressive measurement of ji. Second, given 
that the support of \i need not be similar to that of C, the ensuing convex program 
can be reformulated as 

N 

(P^-i,) min H^llx + ^ ||si|| 2 s.t C = *5, W, \\z t - * t (/x + Cx t )|| 2 < e. (13) 

i=i 

As with the convex formulation, the model-based CoSAMP algorithm described in Al- 
gorithm[T]can be modified to incorporate the mean term /a; an additional modification 
here is the requirement to specify a priori the sparsity of the mean = \\^ t /jl\\q. 

6. Experiments. We present a range of experiments validating various aspects 
of the CS-LDS framework. We use permuted noiselets 13 for the measurement 
matrices, since they have a fast scalable implementation. We use the term compression 
ratio N/M to denote the reduction in the number of measurements as compared to the 
Nyquist rate. Finally, we use the reconstruction SNR to evaluate the recovered videos. 
Given the ground truth video yi : y and a reconstruction yi:T 5 the reconstruction SNR 
in dB is defined by 

/ SLllytlli \ 



101 °gio l - 



2 



We compare CS-LDS against frame-by-frame CS, where each frame of the video is 
recovered separately using conventional CS techniques. We use the term Oracle LDS 
when the parameters and video reconstruction are obtained by operating on the orig- 
inal data itself. Oracle LDS estimates the parameters using a rank-d approximation 
of the ground truth data. The reconstruction SNR of the oracle LDS gives an upper 
bound on the achievable SNR. Finally, the ambiguity in the observation matrix (due 
to non-uniqueness of the SVD based factorization) as estimated by Oracle LDS and 
CS-LDS is resolved by finding the best d x d linear transformation that registers the 
two estimates. 

6.1. State sequence estimation. We first provide empirical verification of the 



results derived in Sections \4~1\ and |4~2) It is worth noting that, in the absence of noise, 
Theorem |4.4| suggests exact recovery of the state sequence. In practice, it is important 
to check the robustness of the estimate to measurement noise. Figure [2|a) analyzes 
the performance of the state space estimation for different values of the number of 
common measurements M and different SNRs of the measurement noise. We define 
input SNR in dB as 101og 10 ((2 WYtWD/i^ ' 2 )) •> where a is the standard deviation 
of the noise. Here, we consider the scenario when M ^ 1. The underlying state 
space dimension is d = 10 with T = 500 frames. As expected, for low SNRs, the 
reconstruction SNR is very high even for small values of M. In addition to this, the 
accuracy at M = 1 is acceptable, especially at low SNRs. 

Next, we validate the implications of Section [43} where we discuss the scenario of 
M < 1 by simulating various proportions of missing common measurements. Figure 
[2^b) shows reconstruction SNR for the Hankel matrix in Q for varying amounts of 



CS-LDS 



13 




Fig. 2. Accuracy of state sequence estimation from common measurements. Shown are aggre- 
gate results over 100 Monte- Carlo runs for an LDS with d — 10 and T — 500. For each Monte-Carlo 
run, the system matrices and the state sequence were generated randomly, (a) Reconstruction SNR 
as a function of the number of common measurements M per frame. Each curve is for a different 
level of measurement noise as measured using input SNR. For low noise levels, we obtain a good 
reconstruction SNR ( > 20 dB) even at M = 1; this hints at very high compression ratios, (b) Re- 
construction SNR of the Hankel matrix for the scenario with missing common measurements. We 
can estimate the Hankel matrix very accurately even at 80% missing measurements. This suggests 
immense flexibility in the implementation of the CS-LDS system. 



missing measurements. We recover the Hankel matrix by solving (|9| using CVX |20] . 
Figure |5|b) demonstrates a very high reconstruction SNR even at a very high rate of 
missing measurements. As mentioned earlier, not having to sense common measure- 
ments at all frames is very useful, since we can stagger our acquisition of common and 
innovation measurements. In theory, this enables a measurement strategy where we 
need to sense only one measurement per frame of the video without having to group 
consecutive measurements of the SPC. Hence, we can aim to reconstruct videos at 
the sampling rate of the SPC. To the best of our knowledge, this is the first video CS 
acquisition design capable of doing this. 

6.2. Dynamic Textures. Our test dataset comprises of videos from the DynTex 



dataset [28]. We used the mean+LDS model from Section 5.4 for all the video CS 
experiments with the 2D DCT as the sparsifying basis for the columns of C and 2D 
wavelets as the sparsifying basis for the mean. We used the model-based CoSAMP 
solver in Algorithm [I] for these results, since it provides explicit control of the sparsity 



of the mean and the columns of C. We used ( 12 ) as a guide to select these values. 

Figure [3] shows video reconstruction of a dynamic texture from the DynTex 
dataset [28]. Reconstruction results are under a compression N/M = 234; this is an 
operating point where a frame-to- frame CS recovery is completely infeasible. How- 
ever, the dynamic component of the scene is relatively small (d = 20), which allows 
us to recover the video from relatively few measurements. The reconstruction SNR 
of the recovered videos shown are as follows: Oracle LDS = 24.97 dB, frame-to- frame 
CS = 11.75 dB and CS-LDS = 22.08 dB. 

Figure [4] shows the reconstruction of a video, of 6 blinking LED lights, from the 
DynTex dataset. We show reconstruction results at different compression ratios as 
well as different image resolutions. It is noteworthy that, even at a lOOx compression, 
the reconstruction at a resolution of 256 x 256 pixels preserves fine details. 
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Fig. 3. Reconstruction of a fire texture of length 250 frames and resolution of N — 128 x 128 
pixels, (a-d) Sampling of frames of the (a) Ground truth video, (b) Oracle LDS reconstruction, (c) 
CS-LDS reconstruction, and (d) naive frame-to-frame CS reconstruction. The CS- LDS reconstruc- 
tion closely resembles the Oracle LDS result. For the CS-LDS results, compressive measurements 
were obtained at M = 30 and M = 40 measurements per frame, there by giving a compression ratio 
of 234 x . Reconstruction was performed with d = 20 and K = 30. (e) Ground truth observation 
matrix C. (f) CS-LDS estimate of the observation matrix C. In (e) and (f), the column of the 
observation matrix is visualized as an image. Both the frames of the videos and the observation 
matrices are shown in false- color for better contrast. 
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Fig. 4. Reconstruction of a video comprising of 6 blinking LED lights. We used d = 7, 
M = 3d, and M chosen based on the overall compression ratio N/(M + M). Each row shows a 
sampling of frames of the video reconstructed at a different compression ratios. Inset in each row 
is the resolution of the video used as well as the compression at sensing and the reconstruction 
SNR. While performance degrades with increasing compression, it also gains significantly for higher 
dimensional data; the reconstruction at 256 x 256 pixels preserves finer details. 



Performance with measurement noise: We validate the performance of our re- 
covery algorithm under various amounts of measurement noise. Note that the columns 
of C with larger singular values are, inherently, better conditioned to deal with this 
measurement error. The columns corresponding to the smaller singular values are 
invariably estimated with higher error. Figure [5] shows the performance of the recov- 
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(c) N/M = 526. Input SNR = 10 dB 




(d) N/M = 526. Input SNR = 30 dB 



Fig. 5. Resilience of the CS-LDS framework to measurement noise, (a) Performance plot 
charting the reconstruction SNR as a function of compression ratio N/M. Each curve is for a 
different level of measurement noise as measured using and input SNR. Reconstruction SNRs were 
computed using 32 Monte- Carlo simulations. The "black- dotted" line shows the reconstruction SNR 
for an d — 20 Oracle LDS. (b-d) Snapshots of video frames at various operating points. The dynamic 
texture of Fig. [5] was used for this result. 



ery algorithm for various levels of measurement noise. The effect of the measurement 
noise on the reconstructions is perceived only at low input SNRs. In part, this ro- 
bustness to measurement noise is due to the LDS model mismatch dominating the 
reconstruction error at high input SNRs. As the input SNR drops significantly be- 
low the model mismatch term, predictably, it starts influencing the reconstructions 
more. This provides a certain amount of flexibility in the design of potential CS-LDS 
cameras. 

Gallery of results: Finally, in Figure |6j we demonstrate performance of the CS- 
LDS methodology for sensing and reconstructing a wide range of videos. The reader is 
directed to the supplemental material as well as the project webpage 2 for animated 
videos of these results. 



6.3. Hyperspectral imaging using CS-LDS. The CS-LDS framework can 
be applied to any data that is subspace compressible; in this regard, it can be used to 
sense data other than video. One such example of this is sensing hyperspectral data 
using CS-LDS; in contrast to video, where we consider image variations with time, 
hyperspectral data involves imaging across spectral bands. Hyperspectral data has 
been shown to lie on subspaces [To] ; this occurs due to the alignment of texture edges 
across spectral bands. One hardware implementation of CS-LDS for hyperspectral 
data involves a color-wheel in front of an SPC with a broadband sensor. If the optical 
filters on the color- wheel have a narrow passband, then we can isolate spectral bands 
and treat them much like the frames of a video. An alternate implementation can be 
obtained by using a spectrometer along with a micro- mirror device. 

Hyperspectral data do not necessarily exhibit smoothness across spectral bands. 
Hence, we model the spectral bands as linear systems without any dynamics across the 
spectral bands (and without the additional term for the mean vector). We used the 
convex formulation in (11) for the hyperspectral reconstructions. Figure [7] showcases 
reconstruction results for a hyperspectral data cube along with comparisons with 
naive CS (where each spectral band is reconstructed separately) as well as group- 
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Fig. 6. A gallery of reconstruction results using the CS-LDS framework. Each sub-figure (a-i) 
shows reconstruction results for a different video. The three rows of each sub-figure correspond to, 
from top-bottom, the ground truth video and CS-LDS reconstructions at compression ratios o/20x 
and 50 x. Each column is a frame of the video and its reconstruction. Inset on e ach reconstruction 
is the reconstruction SNR for that result. All videos are from the DynTex dataset \28\ downs ampled 
at a spatial resolution of 128 x 128 pixels. The "code" in quotes refer to the name of the sequence 
in the database. For all videos, d — 15 and M — 3d. The interested reader is directed to the project 
webpage t r !q and the supplemental material for videos of these results. 



sparse CS — where a group sparsity prior is used across spectral bands. We see 
that CS-LDS outperforms both of the straw-man algorithms. As discussed in Section 
|5.3[ CS-LDS achieves higher compression ratios when applied on longer sequences. 
For hyperspectral data, this corresponds to having a large number of spectral bands. 
Figure [8] showcases reconstruction result on a hyperspectral cube from the AIRIBRAD 
dataset HI. The data consists of 2301 spectral bands, each with a spatial resolution 
of 128 x 64 pixels. The reconstructions obtained using CS-LDS are stable over a large 
range of compression ratios as well as parameter values. 



6.4. Application in activity analysis. As mentioned in Section |2.3[ LDSs 
are often used in classification problems, especially in the context of scene/activity 
analysis. A key experiment in this context is to check if the CS-LDS framework 
recovers videos that are sufficiently informative for such applications. To this end, we 
experiment with two different activity analysis datasets: the UCSD Traffic Dataset 
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Fig. 7. Reconstruction of a hyperspectral datacube with 128 spectral bands; the image at each 
spectral band has 128 x 128 pixels, (a) Sampling of the spectral bands of the data, (b - d) Re- 
construction results of CS-LDS at two different operating points and a conventional CS algorithm 
with group-sparse prior. Inset in each sub-figure is the reconstruction SNR in dB. (e) We compare 
performance of various algorithms for varying compression ratios. We used 2D wavelets as the 
sparsifying basis. 



[11] and the UMD Human Activity Dataset [40 . 

The UCSD Traffic Dataset [ll] consists of 254 videos capturing traffic of three 
types: light, moderate, and heavy. Each video is of length 50 frames at a resolution 
of 64 x 64 pixels. Figure [9] shows the reconstruction results on a traffic sequence from 
the dataset. We perform a classification experiment of the videos into these three 
categories. There are four different train-test scenarios provided with the dataset. 
For comparison, we also perform the same experiments with fitting the LDS model 
on the original frames (Oracle LDS). We perform classification at two different values 
of the state space dimension d and at a fixed compression ratio of 25 x . 
The UMD Human Activity Dataset 40 consists 100 videos, each of length 80 
frames, depicting 10 different activities: pickup object, jog, push, squat, wave, kick, 
bend, throw, turn around and talk on cellhpone. Each activity was repeated 10 times, 
so there were a total of 100 sequences in the dataset. As with the traffic experiment, 
we use an LDS model on the image intensity values without any feature extraction. 
Images were cropped to contain the human and resized to 330 x 300. The state space 
dimension was fixed at d = 5 and the compression was varied from 50 x to 200 x . 

For both datasets, we used the Procrustes distance [l2] between the column spaces 
of the observability matrices in the design of a nearest-neighbor classifier. Given the 
observability matrix 0(C, A) defined in ([6]), let Q be an ortho normal matrix such that 
span(Q) = span(0(C, A)). Given two LDSs, the squared Procrustes distance between 
them is given by 



d 2 (Qi,Q 2 ) 



min tr(Qi - Q 2 R) T (Qi ~ Q2R), 

ReR dxd 



where span(Qi) = span(0(Ci, A\)) and span(Q2) = span(0(C2, A2)). We use this 
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Fig. 8. Reconstruction of a hyperspectral datacube with 2301 spectral bands; the image at 
each spectral band has 128 x 64 pixels, (a) Sampling of the three spectral bands of the data, (b 
- f) Reconstruction results of CS-LDS with d — 2 and at various compression ratios, (g - k) 
Reconstruction results of CS-LDS with d = 5 and at various compression ratios. Inset in each 
sub-figure is the reconstruction SNR in dB. We used 2D wavelets as the sparsifying basis. We use 
a false- colormap for enhanced contrast and better visualization of subtle details. 




(c) Reconstructed frames 



yv 



(b) Observation matrix 



-V 
* 



(d) Estimated observation matrix 



Fig. 9. Reconstructions of a traffic scene of N = 64 x 64 pixels at a compression ratio N/M = 
25 7 with d — 15 and K — 40. (a, c) Sampling of the frames of the ground truth and reconstructed 
video, (b, d) The first ten columns of the observation matrix C and the estimated matrix C ; both 
are shown in false color for improved contrast. The quality of reconstruction and LDS parameters 
is sufficient for capturing the flow of traffic as seen in the classification results in Table\l\ 



distance function in a nearest neighbor classifier in the classification experiment. We 
performed a leave-one-execution-out test. The results are summarized in Tables [l] and 
[2] In both classification experiments, the CS-LDS framework obtained a classification 
performance that is comparable to the Oracle LDS. For the UMD Human Activity 
Dataset, both Oracle LDS and CS-LDS obtained a perfect classification score of 100% 
up to a compression ratio of 50 x. This suggests that the CS-LDS framework should 
be extremely useful in a wide range of applications beyond just video recovery. 

7. Discussion. In this paper, we have proposed a framework for the compressive 
acquisition of dynamic scenes modeled as LDSs. In particular, this paper emphasizes 
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Table 1 

Classification results (in%) on the UCSD Traffic Dataset 





Expt 1 


Expt 2 


Expt 3 


Expt 4 


(d = 10) 

Oracle LDS 
CS-LDS 


85.71 
84.12 


85.93 
87.5 


87.5 
89.06 


92.06 
85.71 


(d = 5) 

Oracle LDS 
CS-LDS 


77.77 
85.71 


82.81 
73.43 


92.18 
78.1 


80.95 
76.1 



Table 2 

Classification results (in%) on the UMD Human Activity Database 



Activity 


100 x 


150x 


200 x 
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100 


100 


100 
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100 


100 
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100 


90 
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100 


100 
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Kick 


100 
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100 


100 


100 


Throw 


100 


100 


90 


Turn Around 


100 


100 


100 
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100 


20 
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Average 


94 


90 


78 



the power of predictive/generative video models. In this regard, we have shown that 
a strong model for the scene dynamics enables stable video reconstructions at very 
low measurement rates. In particular, it enables the estimation of the state sequence 
associated with a video even at fractional number of common measurements per video 
frame (M ^ 1). The use of CS-LDS for dynamic scene modeling and classification 
also highlights the purposive nature of the framework. 



Connection to affine-rank minimization: The pioneering work of Fazel 17 
in developing convex optimization techniques to system identification problems has 
interesting parallels to the ideas proposed in this paper. One of the key ideas espoused 
in [l7] is that, when the video sequence yi-.x is an LDS, the block Hankel matrix 
Hank(yi : T, q) is low rank. When we have linear measurements of the video frames, we 
can solve an affine-rank problem to recover the video. However, such methods optimize 
on the Hankel matrix directly and lead to computationally infeasible designs even for 
videos of very small dimensions. In contrast, CS-LDS has been shown to be fast and 
computationally feasible for very large videos involving millions of variables. The key 
is our two-step solution that isolates the space of unknowns into two manageable sets 
and solves for each separately. 

Universality: An attractive property of random matrix-based CS measurement is 
the universality of the measurement process. Universality implies that the sensing 
process is independent of the subsequent reconstruction algorithm. This makes the 
sensing design "future-proof" ; for such systems, if we devise a more sophisticated and 
powerful recovery algorithm in the future, then we do not need to redesign the cam- 



20 



ACS, PKT, RC and RGB 



era or the sensing framework. The CS-LDS framework violates this property. The 
two-step measurement process of Section |3j which is key to breaking the bilinearity 
introduced by the LDS prior, implies that the CS-LDS design is not universal. An in- 
triguing direction for future research is the design of a universal CS-LDS measurement 
process. 

Online tracking: We have made the assumption of a static observation matrix 
C. However, as the length of the video increases, the assumption of a static C is 
satisfied only by increasing the state space dimension. An alternate approach is to 
allow for a time- varying observation matrix C{t) and track it from the compressive 
measurements. This would give us the benefit of a low state space dimension and yet, 
be accurate when we sense for long durations. 

Beyond LDS: While the CS-LDS framework makes a compelling case study of LDSs 
for video CS, its applicability to arbitrary videos is limited. In particular, it does not 
extend to simple non-stationary scenes such as people walking or panning cameras (see 
the result associated with Figure |6jh)). This motivates the search for models more 
general than LDS. In this regard, a promising line of future research is to leverage our 
models from the video compression literature for CS recovery. 
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